Clustering Template Based Web Documents

نویسنده

  • Thomas Gottron
چکیده

More and more documents on theWorld WideWeb are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result. As more and more documents on the World Wide Web are generated automatically by Content Management Systems (CMS), more and more of them are based on templates. Templates can be seen as framework documents which are filled with different contents to compile the final documents. They are a standard (if not even essential) CMS technology. Templates provide the managed web sites with an easy to manage uniform look and feel. A technical side effect is that the source code of template generated documents is always very similar. Several algorithms have been developed to automatically detect these template structures in order to identify and / or extract particular parts of a document such as the main content. These structure detection algorithms depend on training sets of documents which are all supposed to be based on the same template. Only few works though address the problem of actually creating these clean training sets or verifying that the documents in a given training set are all based on the same template. Approaches trying to handle this problem usually involve clustering the documents to group together those which have large structural similarities. However, to our knowledge this process has never been analysed or verified itself. In this paper we take a closer look at web document distance measures which are supposed to reflect template related structural similarities and dissimilarities. We will evaluate the distance measures both under the aspect of computational costs and – given different clustering approaches – how suitable they are to cluster documents according to their underlying templates. The evaluation is based on a corpus of 500 web documents, taken from different sub-categories of five different web sites. We proceed as follows: In section 1 we give an overview over related works in this fields, focussing in particular on distance measures for web documents which take into account mainly structural information. In 2 and 3 we describe six different distance measures in more detail and some standard cluster analysis algorithms we used. The experiment setup and results are presented in section 4 before we conclude the paper in 5 with a discussion of the results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Ant-based Template Matching for Web Documents Categorization

The self-organization behavior exhibited by ants may be modeled to solve real world clustering problems. The general idea of artificial ants walking around in search space to pick up, or drop an item based upon some probability measure has been examined to cluster a large number of World Wide Web (WWW) documents. However, this idea is extended with the direct application of template matching wi...

متن کامل

Bridging the Gap: from Multi Document Template Detection to Single Document Content Extraction

Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter...

متن کامل

A Hybrid Approach to Statsistical and Semantical Analysis of Web Documents

This paper describes a new approach to improve the analysis and categorization of web documents using statistical methods for template based clustering as well as semantical analysis based on terminological ontologies. A domain-specific environment serves for prove of concept. In order to demonstrate the widespread practical benefit of our approach, we outline a combined mathematical and semant...

متن کامل

A Methodology for Template Extraction from Heterogeneous Web Pages

The World Wide Web is a vast and most useful collection of information. To achieve high productivity in publishing the web pages are automatically evaluated using common templates with contents. The templates are considered harmful because they compromise the relevance judgement of many web information retrieval and web mining methods such as clustering and classification and badly impact the p...

متن کامل

A Novel Weighted Phrase-Based Similarity for Web Documents Clustering

Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008